fix(releases): Prevent row-lock contention on last_seen bump#115443
Conversation
…aseProjectEnvironment and ReleaseEnvironment High-volume releases cause concurrent workers to pile up on the same row's UPDATE for last_seen, hitting statement_timeout (SENTRY-5HQZ). This adds a cache-based distributed lock so only one worker per row per 60s attempts the DB update, and catches OperationalError so a failed bump doesn't abort the rest of the event save pipeline. Fixes SENTRY-5HQZ
| if cache.add(bump_key, "1", timeout=60): | ||
| try: | ||
| cls.objects.filter( | ||
| id=instance.id, last_seen__lt=datetime - timedelta(seconds=60) |
There was a problem hiding this comment.
This could probably be updated to last_seen__lt=datetime now, since the lock is doing the work for us
| if cache.add(bump_key, "1", timeout=60): | ||
| try: | ||
| cls.objects.filter( | ||
| id=instance.id, last_seen__lt=datetime - timedelta(seconds=60) |
The cache lock handles the 60s throttle now, so the SQL filter only needs to prevent setting last_seen backwards.
| if cache.add(bump_key, "1", timeout=60): | ||
| try: | ||
| cls.objects.filter(id=instance.id, last_seen__lt=datetime).update( | ||
| last_seen=datetime | ||
| ) | ||
| except OperationalError: | ||
| metric_tags["bumped"] = "error" | ||
| return instance | ||
| instance.last_seen = datetime | ||
| cache.set(cache_key, instance, 3600) | ||
| metric_tags["bumped"] = "true" | ||
| else: | ||
| metric_tags["bumped"] = "skipped" |
There was a problem hiding this comment.
This is a use-case that our buffers system can handle. We currently use buffers to handle updating last_seen times on Group and Release models. It could be used here as well. However, what you have is operationally lighter and if we don't need to capture the tail of each update set it will work well.
There was a problem hiding this comment.
oh I didnt even consider that, but yea I think pragmatically maybe its better to just keep this simple way here, and this is much less critical to get absolutely correct
| if cache.add(bump_key, "1", timeout=60): | ||
| try: | ||
| cls.objects.filter(id=instance.id, last_seen__lt=datetime).update( | ||
| last_seen=datetime | ||
| ) | ||
| except OperationalError: | ||
| metrics_tags["bumped"] = "error" | ||
| return instance | ||
| instance.last_seen = datetime | ||
| cache.set(cache_key, instance, 3600) | ||
| metrics_tags["bumped"] = "true" | ||
| else: | ||
| metrics_tags["bumped"] = "skipped" |
There was a problem hiding this comment.
With two of these (and potentially more) should we have a context manager to encapsulate this behavior?
with periodic_update(key=bump_key, timeout=60, metrics_tags=metrics_tags):
cls.objects.filter(id=instance.id, last_seen__lt=datetime).update(last_seen=datetime)There was a problem hiding this comment.
yea I wondered about the duplication, that a nce wrapper though, ill try and see if I can reduce it to something clean since it has an early return there and a cache update
The throttled last_seen bump pattern was duplicated across ReleaseEnvironment and ReleaseProjectEnvironment. Extract it into sentry/utils/last_seen.py so both models share one implementation.
Model doesn't have a last_seen attribute as far as mypy knows. Using Any avoids the attr-defined errors without needing a Protocol.
…ted rows Use a HasLastSeen Protocol to type the instance parameter instead of Any. Keep model_class as Any since typing the Django manager chain is impractical. Also restore the bumped=false metric tag when a row is newly created, which was dropped during the extraction.
Summary
UPDATE last_seenon the sameReleaseProjectEnvironment/ReleaseEnvironmentrow, triggering PostgreSQLstatement_timeoutcancellations (SENTRY-5HQZ — 6105 occurrences).OperationalErroraborts the rest of the event save pipeline (nodestore persistence, release counts, group release records), causing silent data loss.cache.add-based distributed lock so only one worker per row per 60s attempts the DB update, eliminating the thundering herd. Also catchesOperationalErroras a safety net so a failed best-effortlast_seenbump never blocks event processing.ReleaseProjectEnvironmentandReleaseEnvironmentwhich had identical vulnerable patterns.Fixes SENTRY-5HQZ
Test plan
test_bump_skipped_when_cache_lock_held— verifies second worker skips the DB update when the cache lock is heldtest_bump_survives_operational_error— verifiesOperationalErroris caught and instance is returned